Pandas Intro

Pandas stands for python annotated n-dimensional arrays, which describes its functionality pretty well. Pandas is a library for reading, changing, creating and analysing data. Thanks to Pandas being built on numpy it also has very high performance.

import numpy as np
import pandas as pd

Pandas consists of mainly 2 data structures, Series and Dataframes.

Series

A Series is a 1D labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.) however it can only contain one data type at a time. A Series also has an index, a list of axis labels or identifiers, which you can see on the left.

In simple words you can imagine a Series as a column.

From List

series_list = pd.Series([1, 2, 3, 4, 5, 6, 7])
series_list

  1
  2
  3
  4
  5
  6
  7
dtype: int64

type(series_list)

pandas.core.series.Series

Like a numpy array, a pandas Series has a dtype. This is often a numpy dtype but pandas has also added a few of its own.

series_list.dtype

dtype('int64')

Also just like a numpy array, you can look at the shape i.e the dimensions of a Series/DataFrame.

series_string = pd.Series(["Bob", "John", "Joe"])
series_string.shape

(3,)

series_np = pd.Series(np.array([10, 20, 30, 40, 50, 60]))
series_np

  10
  20
  30
  40
  50
  60
dtype: int64

From Dictionary

series_dict = pd.Series({"a": 0.0, "b": 1.0, "c": 2.0})
series_dict

a    0.0
b    1.0
c    2.0
dtype: float64

Custom indexes

Pandas generates by default an index for us which is a sequence of integers but this can be changed.

indexes = np.arange(0, 5, 1)
series_index = pd.Series(
    [2 ** x for x in indexes],
    index=indexes
)
series_index

   1
   2
   4
   8
  16
dtype: int64

series_string_index = pd.Series(np.arange(1, 6, 1), index=["a", "b", "c", "d", "e"])
series_string_index

a    1
b    2
c    3
d    4
e    5
dtype: int64

Iterating over Series

Just like most other data structures in python, it is possible to iterate over them using simple for loops

for value in series_string_index:
    print(value)

for key in series_string_index.keys():  # key being the row index
    print(key)

a
b
c
d
e

for index in series_string_index.index:
    print(index)

a
b
c
d
e

DataFrame

A DataFrame is a 2D labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series/List objects.

If we consider column names as keys and lists of items under that column as values, we can easily use a python dict to represent a DataFrame. Just like with the Series if we haven’t provided a row index it automatically generates a sequence of integers.

From dict of lists/numpy arrays

my_dict = {
    "name": ["a", "b", "c", "d", "e", "f", "g"],
    "age": [20, 27, 35, 55, 18, 21, 35],
    "designation": ["VP", "CEO", "CFO", "VP", "VP", "CEO", "MD"]
}

df_dict = pd.DataFrame(my_dict)
df_dict

	name	age	designation
0	a	20	VP
1	b	27	CEO
2	c	35	CFO
3	d	55	VP
4	e	18	VP
5	f	21	CEO
6	g	35	MD

df_index = pd.DataFrame(
    my_dict,
    index=["First", "Second", "Third", "Fourth", "Fifth", "Sixth", "Seventh"]
)
df_index

	name	age	designation
First	a	20	VP
Second	b	27	CEO
Third	c	35	CFO
Fourth	d	55	VP
Fifth	e	18	VP
Sixth	f	21	CEO
Seventh	g	35	MD

From CSV

df_cars = pd.read_csv("data/cars.csv", sep=";", skiprows=[1])
df_cars

	Car	MPG	Cylinders	Displacement	Horsepower	Weight	Acceleration	Model	Origin
0	Chevrolet Chevelle Malibu	18.0	8	307.0	130.0	3504.0	12.0	70	US
1	Buick Skylark 320	15.0	8	350.0	165.0	3693.0	11.5	70	US
2	Plymouth Satellite	18.0	8	318.0	150.0	3436.0	11.0	70	US
3	AMC Rebel SST	16.0	8	304.0	150.0	3433.0	12.0	70	US
4	Ford Torino	17.0	8	302.0	140.0	3449.0	10.5	70	US
...	...	...	...	...	...	...	...	...	...
401	Ford Mustang GL	27.0	4	140.0	86.0	2790.0	15.6	82	US
402	Volkswagen Pickup	44.0	4	97.0	52.0	2130.0	24.6	82	Europe
403	Dodge Rampage	32.0	4	135.0	84.0	2295.0	11.6	82	US
404	Ford Ranger	28.0	4	120.0	79.0	2625.0	18.6	82	US
405	Chevy S-10	31.0	4	119.0	82.0	2720.0	19.4	82	US

406 rows × 9 columns

Meta Information

We can also look at some basic information of our DataFrame.

df_cars.shape

(406, 9)

df_cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 406 entries, 0 to 405
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Car           406 non-null    object 
 1   MPG           406 non-null    float64
 2   Cylinders     406 non-null    int64  
 3   Displacement  406 non-null    float64
 4   Horsepower    406 non-null    float64
 5   Weight        406 non-null    float64
 6   Acceleration  406 non-null    float64
 7   Model         406 non-null    int64  
 8   Origin        406 non-null    object 
dtypes: float64(5), int64(2), object(2)
memory usage: 28.7+ KB

If the dtype is object then it is in most cases a string. We can also look at the first or last data entries

df_cars.head()  # First 10
df_cars.head(5)  # First 5

	Car	MPG	Cylinders	Displacement	Horsepower	Weight	Acceleration	Model	Origin
0	Chevrolet Chevelle Malibu	18.0	8	307.0	130.0	3504.0	12.0	70	US
1	Buick Skylark 320	15.0	8	350.0	165.0	3693.0	11.5	70	US
2	Plymouth Satellite	18.0	8	318.0	150.0	3436.0	11.0	70	US
3	AMC Rebel SST	16.0	8	304.0	150.0	3433.0	12.0	70	US
4	Ford Torino	17.0	8	302.0	140.0	3449.0	10.5	70	US

df_cars.tail()  # Last 10
df_cars.tail(5)  # Last 5

	Car	MPG	Cylinders	Displacement	Horsepower	Weight	Acceleration	Model	Origin
401	Ford Mustang GL	27.0	4	140.0	86.0	2790.0	15.6	82	US
402	Volkswagen Pickup	44.0	4	97.0	52.0	2130.0	24.6	82	Europe
403	Dodge Rampage	32.0	4	135.0	84.0	2295.0	11.6	82	US
404	Ford Ranger	28.0	4	120.0	79.0	2625.0	18.6	82	US
405	Chevy S-10	31.0	4	119.0	82.0	2720.0	19.4	82	US

Accesing Data

df_cars.columns

Index(['Car', 'MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',
       'Acceleration', 'Model', 'Origin'],
      dtype='object')

df_cars["Car"]

    Chevrolet Chevelle Malibu
            Buick Skylark 320
           Plymouth Satellite
                AMC Rebel SST
                  Ford Torino
                 ...            
            Ford Mustang GL
          Volkswagen Pickup
              Dodge Rampage
                Ford Ranger
                 Chevy S-10
Name: Car, Length: 406, dtype: object

This is also works but other is preferred as this can cause clashes with existing DataFrame attributes for example count

df_cars.Car

    Chevrolet Chevelle Malibu
            Buick Skylark 320
           Plymouth Satellite
                AMC Rebel SST
                  Ford Torino
                 ...            
            Ford Mustang GL
          Volkswagen Pickup
              Dodge Rampage
                Ford Ranger
                 Chevy S-10
Name: Car, Length: 406, dtype: object

You can also get Multiple columns.

df_cars[["Car", "MPG", "Cylinders"]]

	Car	MPG	Cylinders
0	Chevrolet Chevelle Malibu	18.0	8
1	Buick Skylark 320	15.0	8
2	Plymouth Satellite	18.0	8
3	AMC Rebel SST	16.0	8
4	Ford Torino	17.0	8
...	...	...	...
401	Ford Mustang GL	27.0	4
402	Volkswagen Pickup	44.0	4
403	Dodge Rampage	32.0	4
404	Ford Ranger	28.0	4
405	Chevy S-10	31.0	4

406 rows × 3 columns

The iloc function stands for integer location and allows us to easly access data by using there indexes.

df_cars.iloc[0]  # row with index 0

Car             Chevrolet Chevelle Malibu
MPG                                  18.0
Cylinders                               8
Displacement                        307.0
Horsepower                          130.0
Weight                             3504.0
Acceleration                         12.0
Model                                  70
Origin                                 US
Name: 0, dtype: object

We can also multiple rows and specify certain columns but as we are using iloc it has to be the indexes of the columns.

df_cars.iloc[[0, 1], [0, 2, 3]]  # row 0 and 1 with 3 columns

	Car	Cylinders	Displacement
0	Chevrolet Chevelle Malibu	8	307.0
1	Buick Skylark 320	8	350.0

We can also us slicing just like with python lists.

df_cars.iloc[0:5, 0:3]

	Car	MPG	Cylinders
0	Chevrolet Chevelle Malibu	18.0	8
1	Buick Skylark 320	15.0	8
2	Plymouth Satellite	18.0	8
3	AMC Rebel SST	16.0	8
4	Ford Torino	17.0	8

To access data there is also the loc function which works the same way as iloc but uses the labels not the indexes.

df_cars.loc[[0, 1], ["Car", "Cylinders"]]

	Car	Cylinders
0	Chevrolet Chevelle Malibu	8
1	Buick Skylark 320	8

df_index.loc[["First", "Second"]]

	name	age	designation
First	a	20	VP
Second	b	27	CEO

You can also use slices with loc important here is however that the end values is inclusive!

df_cars.loc[0:5, "Car":"Cylinders"]

	Car	MPG	Cylinders
0	Chevrolet Chevelle Malibu	18.0	8
1	Buick Skylark 320	15.0	8
2	Plymouth Satellite	18.0	8
3	AMC Rebel SST	16.0	8
4	Ford Torino	17.0	8
5	Ford Galaxie 500	15.0	8

df_cars["Cylinders"].value_counts()

  207
  108
   84
    4
    3
Name: Cylinders, dtype: int64

Indexes

As we have seen so far indexes are identifiers for rows. By default an index for a DataFrame or Series is just an integer sequence but as we have seen it can also be something different for example we can set a certain column as the index.

df_cars.set_index("Car")

	MPG	Cylinders	Displacement	Horsepower	Weight	Acceleration	Model	Origin
Car
Chevrolet Chevelle Malibu	18.0	8	307.0	130.0	3504.0	12.0	70	US
Buick Skylark 320	15.0	8	350.0	165.0	3693.0	11.5	70	US
Plymouth Satellite	18.0	8	318.0	150.0	3436.0	11.0	70	US
AMC Rebel SST	16.0	8	304.0	150.0	3433.0	12.0	70	US
Ford Torino	17.0	8	302.0	140.0	3449.0	10.5	70	US
...	...	...	...	...	...	...	...	...
Ford Mustang GL	27.0	4	140.0	86.0	2790.0	15.6	82	US
Volkswagen Pickup	44.0	4	97.0	52.0	2130.0	24.6	82	Europe
Dodge Rampage	32.0	4	135.0	84.0	2295.0	11.6	82	US
Ford Ranger	28.0	4	120.0	79.0	2625.0	18.6	82	US
Chevy S-10	31.0	4	119.0	82.0	2720.0	19.4	82	US

406 rows × 8 columns

We can however see that this change is not permenant. This is often the case in pandas when operating on Series or DataFrames. If we wanted these changes to be permenant we could call the method with the inplace parameter as True.

df_cars

	Car	MPG	Cylinders	Displacement	Horsepower	Weight	Acceleration	Model	Origin
0	Chevrolet Chevelle Malibu	18.0	8	307.0	130.0	3504.0	12.0	70	US
1	Buick Skylark 320	15.0	8	350.0	165.0	3693.0	11.5	70	US
2	Plymouth Satellite	18.0	8	318.0	150.0	3436.0	11.0	70	US
3	AMC Rebel SST	16.0	8	304.0	150.0	3433.0	12.0	70	US
4	Ford Torino	17.0	8	302.0	140.0	3449.0	10.5	70	US
...	...	...	...	...	...	...	...	...	...
401	Ford Mustang GL	27.0	4	140.0	86.0	2790.0	15.6	82	US
402	Volkswagen Pickup	44.0	4	97.0	52.0	2130.0	24.6	82	Europe
403	Dodge Rampage	32.0	4	135.0	84.0	2295.0	11.6	82	US
404	Ford Ranger	28.0	4	120.0	79.0	2625.0	18.6	82	US
405	Chevy S-10	31.0	4	119.0	82.0	2720.0	19.4	82	US

406 rows × 9 columns

df_cars.set_index("Car", inplace=True)
df_cars

	MPG	Cylinders	Displacement	Horsepower	Weight	Acceleration	Model	Origin
Car
Chevrolet Chevelle Malibu	18.0	8	307.0	130.0	3504.0	12.0	70	US
Buick Skylark 320	15.0	8	350.0	165.0	3693.0	11.5	70	US
Plymouth Satellite	18.0	8	318.0	150.0	3436.0	11.0	70	US
AMC Rebel SST	16.0	8	304.0	150.0	3433.0	12.0	70	US
Ford Torino	17.0	8	302.0	140.0	3449.0	10.5	70	US
...	...	...	...	...	...	...	...	...
Ford Mustang GL	27.0	4	140.0	86.0	2790.0	15.6	82	US
Volkswagen Pickup	44.0	4	97.0	52.0	2130.0	24.6	82	Europe
Dodge Rampage	32.0	4	135.0	84.0	2295.0	11.6	82	US
Ford Ranger	28.0	4	120.0	79.0	2625.0	18.6	82	US
Chevy S-10	31.0	4	119.0	82.0	2720.0	19.4	82	US

406 rows × 8 columns

You can also sort the rows by using the index

df_cars.sort_index()

	MPG	Cylinders	Displacement	Horsepower	Weight	Acceleration	Model	Origin
Car
AMC Ambassador Brougham	13.0	8	360.0	175.0	3821.0	11.0	73	US
AMC Ambassador DPL	15.0	8	390.0	190.0	3850.0	8.5	70	US
AMC Ambassador SST	17.0	8	304.0	150.0	3672.0	11.5	72	US
AMC Concord	19.4	6	232.0	90.0	3210.0	17.2	78	US
AMC Concord	24.3	4	151.0	90.0	3003.0	20.1	80	US
...	...	...	...	...	...	...	...	...
Volvo 145e (sw)	18.0	4	121.0	112.0	2933.0	14.5	72	Europe
Volvo 244DL	22.0	4	121.0	98.0	2945.0	14.5	75	Europe
Volvo 245	20.0	4	130.0	102.0	3150.0	15.7	76	Europe
Volvo 264gl	17.0	6	163.0	125.0	3140.0	13.6	78	Europe
Volvo Diesel	30.7	6	145.0	76.0	3160.0	19.6	81	Europe

406 rows × 8 columns

df_cars.reset_index(inplace=True)
df_cars

	Car	MPG	Cylinders	Displacement	Horsepower	Weight	Acceleration	Model	Origin
0	Chevrolet Chevelle Malibu	18.0	8	307.0	130.0	3504.0	12.0	70	US
1	Buick Skylark 320	15.0	8	350.0	165.0	3693.0	11.5	70	US
2	Plymouth Satellite	18.0	8	318.0	150.0	3436.0	11.0	70	US
3	AMC Rebel SST	16.0	8	304.0	150.0	3433.0	12.0	70	US
4	Ford Torino	17.0	8	302.0	140.0	3449.0	10.5	70	US
...	...	...	...	...	...	...	...	...	...
401	Ford Mustang GL	27.0	4	140.0	86.0	2790.0	15.6	82	US
402	Volkswagen Pickup	44.0	4	97.0	52.0	2130.0	24.6	82	Europe
403	Dodge Rampage	32.0	4	135.0	84.0	2295.0	11.6	82	US
404	Ford Ranger	28.0	4	120.0	79.0	2625.0	18.6	82	US
405	Chevy S-10	31.0	4	119.0	82.0	2720.0	19.4	82	US

406 rows × 9 columns

Pandas Intro

Series​

From List​

From Dictionary​

Custom indexes​

Iterating over Series​

DataFrame​

From dict of lists/numpy arrays​

From CSV​

Meta Information​

Accesing Data​

Indexes​

Series

From List

From Dictionary

Custom indexes

Iterating over Series

DataFrame

From dict of lists/numpy arrays

From CSV

Meta Information

Accesing Data

Indexes